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Extraction and classification algorithms based on kernel nonlinear features are 
popular in the new direction of research in machine learning. This research 
paper considers their practical application in the iTaukei automatic speaker 
recognition system (ASR) for cross-language speech recognition. Second, 
nonlinear speaker-specific extraction methods such as kernel principal 
component analysis (KPCA), kernel independent component analysis (KICA), 
and kernel linear discriminant analysis (KLDA) are summarized. 
The conversion effects on subsequent classifications were tested in 
conjunction with Gaussian mixture modeling (GMM) learning algorithms; in 
most cases, computations were found to have a beneficial effect on 
classification performance. Additionally, the best results were achieved by 
the Kernel linear discriminant analysis (KLDA) algorithm. The performance 
of the ASR system is evaluated for clear speech to a wide range of speech 
quality using ATR Japanese C language corpus and self-recorded iTaukei 
corpus. The ASR efficiency of KLDA, KICA, and KLDA technique for 6 sec 
of ATR Japanese C language corpus 99.7%, 99.6%, and 99.1% and equal error 
rate (EER) are 1.95%, 2.31%, and 3.41% respectively. The EER improvement 
of the KLDA technique-based ASR system compared with KICA and KPCA 
is 4.25% and 8.51% respectively. 


This is an open access article under the CC BY-SA license. 





Corresponding Author: 
Satyanand Singh, 


School of Electrical and Electronics Engineering, 


Fiji National Universisty, Fiji. 


Email: satyanand.singh@fnu.ac. fj 








1. INTRODUCTION 

ASR is implemented using very conventional statistical modeling techniques such as GMM or ANN 
modeling. But in the past few years, machine learning theory has evolved into a variety of new algorithms for 
learning and classification. The so-called kernel-based method, in particular, has recently become a promising 
new path to science. Kernel-based classification and regression techniques like the well-known SVM found 
a fairly slow expression. That may be because to address theoretical and practical problems it needs to be 
applied to large tasks such as speech recognition. Recently, however, more and more authors have been 
concerned about the use of support vector machines in speech recognition [1]. 

Besides using kernel-based classifiers, an alternate way is to use kernel-based technologies only to 
convert the feature space and leave the classification job to more conventional methods. The purpose of this 
paper is to study the applicability of some of these methods to classify phonemes, using kernel-based 
pre-learning speaker-specific feature extraction methods to improve ASR classification rates. This paper 
mainly discusses KPCA [2], KICA [3], KLDA. 
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Usually, a traditional ASR process consists of two phases: a training phase, and a test phase. 
In the training phase, the device extracts speaker-specific characteristics from the speech signal to be used to 
create a speaker model [1], where the aim of the test phase is to determine the speaking samples that fit 
the individual of the training sample. he original speech signal is transformed into a vector representation of 
the function [2] in all audio signal processing. Linear prediction cepstral coefficients (LPCC) and perceptual 
linearity predicted cepstrum coefficient (PLPCC), mal frequency cepstral coefficient (MFCC) [3] approach is 
most commonly used in the ASR system to obtain speaker-specific features. For modeling, discriminant 
classifiers in support vector machine (SVM) [4] representation have achieved impressive results in many ASR 
systems. SVM will definitely effectively train non-linear boundaries for decision-making by classifying 
interesting speakers/imposters as they are distinct. 

Although these feature extraction techniques are effective, non-linear mapping of speech features to 
new suitable spaces may generate new features that can better identify speech categories. Kernel-based 
technology has been applied to a variety of learning machines, including support vector machines (SVM), 
Kernel discriminant analysis (KDA), kernel principal component analysis (KPCA) [5]. The latter two methods 
are widely used in image recognition. Their performance in speaker recognition, however, has not been 
carefully investigated. 

The purpose of this paper is to examine the applicability of some of these methods to classify 
phonemes, using kernel-based feature extraction methods applied before learning to boost classification levels. 
Essentially, this paper deals with the strategies of KPCA, KICA [6, 7], KLDA [5], and Kernel springy 
discriminant analysis (KSDA) [8]. In this work, KPCA, KICA, and KLDA is used for speaker specific feature 
extraction with an ASR system. With KPCA, speaker-specific features can be expressed in a high dimension 
space which can possibly generate more distinguishable speaker features. 


2. FUNDAMENTAL OF PRINCIPAL COMPONENT ANALYSIS 
Principal component analysis (PCA) is a very common method of dimensionality reduction and 
feature extraction. PCA attempts to find linear subspaces that are smaller in size than the original feature space, 


with new features having the largest variance [9]. Consider the data set {x;} where i = 1, 2,3,....,N, each x; 
is a D-dimensional vector. Now we project the data into the M-dimensional subspace, here, M < D. 
The projection is represented as y = Ax, where A = [u], u3}, ...., ul, ],and uju, = 1 for k = 1,2,3,....,M. 


We want to maximize the variance of {y;}, which is the trace of the covariance matrix of {y;}. 


A* = arg “es ir (Sy) (1) 
where, 

Sy = FEL 0: -0-A (2) 
and 

y= IL Xi 6) 


Covarience matrix of {X;} is the Sx, since tr(S,) = tr(ASyA"), by using the Lagrangian multiplier 
and taking the derivative, we get, 


SxUk = Akk (4) 
which indicates uç is the eigenvector of Sy and now X; can be represented as follows; 

Xi = Dear X7 Ux) Ux (5) 
Xi can be approximated by X; and expressed as follows: 

X = Year (X7 Ux) Ux (6) 


where ugis the eigenvector of Sycorresponding to the kth largest eigenvalue. Standard PCA results for 
the two-speaker’s audio data shown in Figure | (a). 
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2.1. Kernel PCA methodology for dimensionality reduction in ASR system 

Standard PCA only allows linear size reduction. However, standard PCA is not very useful when 
the data has a more complex structure that cannot be represented well in linear subspaces. Fortunately, 
the kernel PCA allows us to extend the standard PCA to nonlinear dimensionality reduction [10]. Assume that 
a set of observations is given X; E€ R",i = 1,2, 3,....,m. Consider the inner dot product space F associated 
with the input space by a map ¢: R” > F may be non-linear. The feature space F has an arbitrary size and 
in some cases has an infinite dimension. Here, uppercase letters used for elements of F, and lowercase letters 
are used for elements of R”. Suppose we are working on centered data }\", @(X;) = 0. In F, the covariance 
matrix has the form as follows: 


c =y (x)(x) (7) 


Eigenvalues A > 0 and nonzero eigenvectors V € F\(0) satisfying CV = AV. It is well known that all 
solutions V with A + 0 are in the range of {@(X;)}i2,. This has two consequences. First, consider a set of 
equations (P(X,), CV) = A(b(X,),V), for all k =1,2,3,..,m and second there exist coefficients 
aj,i=1,2,3,..,m in such a way that V = Xi- aip (Xi). Combining ($ (Xp), CV) = A(b(Xx),V) and 
V = Ņi aib(X%;) we get the dual representation of the eigenvalue problem as 
E NP (Xu), Dies P(X) (G(X), DADY = ALE alP(Xe), pX) for all k = 1,2,3,...,m. We are 
defining a mXm matrix by Kij = (Xi), p), this makes K*a = mAKa. Where a denoted as a column 
vectors with @1, @2, @3, ....,@m entries. 

Let A, =A, = +++ Ap be the eigenvalue of K, a1, a?,...,a@™ be the set of corresponding eigenvectors, 
and 4, be the last non-zero eigenvalue. Normalizing a1, a?,...,a" by needing the corresponding vectors in F 
be normalized (V*,V*)=1, for all k =1,2,...,r. Considering V = ™,a;6(X;) and Ka = maa, 
the normalization condition of a1, a@,..., a”can be rewritten as follows; 


1= YR af a (h(xi), pl) = DT af aK, = (a, Ka") = A,(a*, a*) (8) 


for the purpose of principal component extraction, we need to compute the projections onto 
the eigenvectors V* in F, for k = 1,2, ...,r. Let y be the test point, with an image (y) in F. 


V=, pO) = Li ab), 60) (9) 
(V*, @(y)) nonlinear principal component corresponding to @. 


2.2. Computation of covariance matrix and dot product matrix by positioning on feature space 

For the sake of simplicity, we assume that the observations are at the center. This is easy to implement 
in the input space because it is not possible to explicitly calculate the average of the observations mapped with 
F, but it is more difficult to use F. Assume that any ¢ and any series of observations X4, X2,..., Xm are given 


then let us define ¢ =< ™ _(X;) and then the point 6(X;) = ¢(X:) — p will be centered. Therefore, 
the above assumption holds, we defin the covariance matrix and the dot product matrix R;; = ($ (X), 6(X;)) 
in F. We known eigenvalue problems as m1@ = K& with @ is the expansion coefficient of the eigenvector 
relative to the center point @(X;). Since there is no central data, K cannot be explicitly calculated, but it can be 
represented by a corresponding K without a center therefore Kj; = ($ (XD — ¢,0(X;) — 6) =Ki; — 


= AE Kit — ~ ym ksj + at kę. We can get more compact expression by using the vector 
1m = (1,...,1)". The compact expression is K = K — Klm 17 -— = 1m1hK + = (17, K1.,)17,K1,.Wecan 


calculate K from K and solve the eigenvalue problem. Consider test point Y projection of the center point of 
the center d-image of Y to the feature vector of the covariance matrix is computed to find its coordinates [11]. 


(PY), TE = (OY) — 6, 0") = EM, EOY) — 6, pX - o) 
= 5 a {K(Y,X,) -EEL K(X, Xi) -ŻEL KUY, X) +L KX XD} (10) 


Introducing the vector Z. 
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Z= (KY, XD) nxa (11) 
((6(Y),0*)), e = 270 — =I, KV — = (271,,) 12,7 + — ARK m) 1R 


=z? (Im = a)y -Ż17K (i = Sees 4 
= (Z7 -17K ) (Im -=1n1h) 7 (12) 


Note that KPCA implicitly uses only input variables because the algorithm uses kernel function evaluation to 
represent the reduction in feature space dimensions. Therefore, KPCA is useful for nonlinear feature extraction 
by reducing the size; it does not explain the characteristics of the input variable selection. 


3. KERNEL- BASED SPEAKER SPECIFIC FEATURE EXTRACTION AND ITS APPLICATION 
IN ASR 

Classification algorithms must represent the objects to be classified as points in a multidimensional 
feature space. However, one can apply other vector space transformations to the initial features before running 
the learning algorithm. There are two reasons for doing this. First, they can improve the performance of 
classification and second, they can reduce the data's dimensionality. The selection of initial features and their 
transformation are sometimes dealt with in the literature under the title "feature extraction”. To avoid 
misunderstanding, this section describes only the latter and describes the first feature set. Hopefully it will be 
more effective and classification will be faster. The approach to the extraction of features may be either linear 
or nonlinear, but there is a technique that breaks down the barrier between the two forms in some way. 
The key idea behind the kernel technique was originally presented in [12] and applied again in connection with 
the general purpose SVM [13-15] followed by other kernel-based methods. 


3.1. Supplying input variable information into kernel PCA 

Additional information to the KPCA representation for interpretability. We have developed a process 
to project a given input variable into a subspace spanned by feature vectors V = 11, &@b(X,). We can think 
of our observation as a random vector X = (X4, X2, ....., Xn ) implementation then to represent the prominence 
of the input variable Xx in the KPCA. Considering a set of points of mathematical forms y = a + se, E R” 
where eg = (0,....,1,....,0) of kth component is either 0 or 1. Next, the projection points @(y) of these images 
onto the subspace spanned by the feature vector V = 11, (X1) can be calculated. Considering in (12) 
the row vector gives the induction curve in the Eigen space expressed in matrix form: 


o(s)iyr = (zi z 217K) (i: = ~ 11h) v (13) 


Furthermore, by projecting the tangent vector to s = 0, we can express the maximum change direction 
of o(s) associated with the variable X,. Matrix form of the expression represented as follows: 






































do az 1 T\G 
“| = (in ——1mth) 0 (14) 
Sls=0 ds Is=0 m 
where 
azī _ (“2 azm ) 
Bel ie Va nee a | 
and 
azi _ aK(¥,X;) =( m ĉ@KY, X) ave) om OK(Y.X)) k _ 9K(Y,X;) 
ds Is=0 ds s=0 t= OY, ds s=0 tt OY, Y=a t Yk Y=a 


where delta of Kronecker is represented as Ô¥ and radial basis kernel as k(Y,X;) = exp(—cllY — X;||?) = 
exp(—c U1 (Y; — Xi)? ). After considering y = a + se, E R”: 
dzi 
ds 


_ OK(Y,X;) 
AY, 











= —2cK (a, X;) (ay — Xir) = —2cK (Xg, Xi)(Xgr — Xir) 


s=0 y=a 
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where the training point a = Xz. Thus, by applying (13), it is possible to locally represent any given input 
variable plot in KPCA. Furthermore, by using (14), it is possible to represent the tangent vector associated with 
any given input variable at each sample point [16]. Therefore, a vector field can be drawn on KPCA indicating 
the growth direction of a given variable. 

There are some existing techniques to compute z for specific kernels [17]. For a Gaussian kernel 
(X, Y)) = exp(-||X — Y||?/207) , z must satisfy the following condition; 


z = Lier VilIZ—Xill?/207) Xi l 
Zit Yi(llZ-X;l2)/20? (15) 


Kernel PCA results for the two- speaker’s audio data with is shown in Figure 1 (b). 
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Figure 1. Standard PCA and Kernel PCA results for the two-speaker’s audio data; 
(a) standard PCA and (b) kernel PCA 


3.2. Application of kernel independent component analysis (KICA) 

Independent component analysis is a general statistical approach originally born from the study of 
separation from blind sources. Another application of ICA is the unsupervised extraction of features. This is 
intended to transform input data linearly into uncorrelated elements, using at least a distribution of the Gaussian 
sample set [18]. The explanation for this is that classification of data in certain directions would be simpler. 
This is in accordance with the most popular speech modeling technique, i.e. fitting Gaussian mixtures on each 
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class. This obviously means that Gaussian mixtures can approximate the distributions of the groups KICA 
extends this by assuming, on the contrary, that when all classes are fused, the distribution is not Gaussian; thus, 
using non-Gaussianism as a heuristic for the uncontrolled extraction of features would prefer those directions 
which separate classes. 

Several objective functions for optimal selection of independent directions were described using 
approximately equivalent approaches. The KICA algorithm's goal itself is to find such objective functions as 
optimally as possible [19]. For KICA output most iterative methods are available. Others need to be 
preprocessed, i.e. focused and whitened while others do not. Overall, experience shows that all of these 
algorithms can converge faster with oriented and whitewashed data, even those that don't really need it [20]. 

Let's first investigate how the centering and whitening pre-processing steps can be done in the kernel 
function space. To this end, allow the kernel function x in F to implicitly define the inner product with 
the associated transformation @. Step one Centering F - Shifting the data œ (X1), $ (X2), ....., P (Xk) along with 
its mean E{@(X)} to get the data as follows: 


p'a) = p) — E) 
' (X2) = $2) — EL) i 


PX) = 6X) — ELX) 


Step two Whitening in F. Transfoprming the centered samples #’(X,), Ø’ (Xz), ....,@'(X_) via an orthogonal 
transformation Q into its vectors 6(X,) = Q'(X,), Qp' (X2), ---, QP (Xk) = QP (Xk) . C = is the covariance 
matrix. Because standard PCA converts the covariance matrix into a diagonal form just like its kernel based 
equivalent, where the diagonal elements are the unique values of the data covariance matrix E {$ (X) P(X i 
all that remains is to transform the diagonal element into 1. Based on this finding, a slight modification of 
the formulas provided in the KPCA section will obtain the necessary whitening transformation [21]. Here 
(a141), (242), ++, (kåk) and A, 2 Az => Ay are the eighpairs of E{@(X)G(x)"} then the transformation 
1 1 1 T 
matrix Q will take a form A, gd, eo, ive Aam . Kernel Independent component analysis results for 


the two-speaker’s audio data is shown in Figure 2 (a). 


3.3. Application of kernel linear discriminant analysis (KLDA) 

LDA is a conventional, supervised method of extracting speaker-specific characteristics [19] that has 
proven to be one of the most effective pre-processing classification techniques. It has also long been used 
in speech recognition [22]. The main goal of LDA is to find a new orthogonal data set to provide the optimal 
class separation. 

In KLDA we are essentially following the discussion of its linear counterpart, except in this case this 
is intended to happen implicitly in the kernel feature space F. Let's say again that a kernel function with 
a feature map and a kernel field space has been chosen. In order to define the transformation matrix A of KLDA, 
we define the objective function first as T : F > R, because of the supervised nature of this method, it depends 
not only on the test data X but also on the indicator £. Let's describe ubiquitous T(V). 


vi av 
vIwv’ 


rv) = 





V E: F\{0} (17) 


where B is the scatter matrix of the interclass, while W is the scatter matrix of the interclass. Here, the scatter 
matrix B between classes shows the scatter of the mean vectors u; around the overall mean vector u. 


k; 


B= wer, (4j = u) (u; = M HS Eke p(x); 4j = z Leo p(x) (18) 


with the class label J, the in-class scatter matrix W represents the weighted average scatter of the sample vector 
covariance matrices Cj. 


We Xia tG ; G= 7 lo-lo) — ;) GE = D (19) 


Kernel linear discriminant analysis results for the two-speaker’s audio data is shown in Figure 2 (b). 
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Kernel Independent Component Analysis 
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Figure 2. Kernel independent component analysis and kernel linear discriminant 
analysis results for the two-speaker’s audio data; (a) kernel independent component analysis 
and (b) kernel Inear discriminant analysis 


4. EXPERIMENTAL SETUP 

To evaluate the efficiency of kernel-based speaker-specific feature extraction techniques, an isolated 
word recognition experiment was performed. The experiment includes 520 Japanese words from the ATR 
Japanese C language set Voice database, 80 speakers (40 men and 40 Female). Audio samples of 10 iTaukei 
speakers were collected at random and under unfavourable conditions. The average duration of the training 
samples was six seconds per speaker for all 10 speakers and out of twenty utterances of each speaker just one 
was used for training purpose [23-27]. For matching purposes the remaining 19 voice samples were used from 
the corpus. We have recorded utterances for this investigation were at one sitting for each speaker. The text for 
the utterances was randomly selected by speaker. The main voice recordings consist of both male and female 
speakers of twenty utterance of each using sampling rate of 16 kHz with 16 bits/sample. 

Throughout the experiment, 10400 utterances were used as training data and the remaining 31,200 
utterances were used as test data. The sampling rate of the audio signal is 10 kHz. 12 Mel-Cepstral coefficients 
extracted using 25.6 ms Hamming windows with 10 ms shifts [28-32]. The features of KPCA were extracted 
from 13 Mel-cepstral coefficients including zero coefficients corresponding to 39 vector coefficients and their 
increment and acceleration coefficients. Around 1,000,000 frames were used as training data in this 
experiment, and it is computationally impossible to calculate matrix K with this amount of data. N frames are 
randomly picked from the training data to reduce the number of frames. The number N= 1024 was chosen to 
make the system computationally feasible. 

Table 1 represents efficiency and EER of the ASR system for KLDA, KICA and KLDA respectively 
for ATR Japanese C language. Table 2 represents efficiency and EER of the ASR system for KLDA, KICA 
and KLDA respectively forl0 iTaukei speakers cross language. Figure 3 show the equal error rate (EER) of 
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KLDA KICA, and KPCA based modeling technique. The ASR efficiency of of KLDA KICA, and KPCA based 
modeling technique are 99.9%, 99.6%, and 98.1% and EER are 4.7%, 4.9% and 5.1% respectively for 6 sec of 
audio signal. The EER improvement of KLDA technique based ASR system compared with KICA and KPCA 
is 4.25% and 8.51% respectively. 
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Figure 3. EER of KLDA, KICA and KLDA technique for 6 sec of voice data; 
(a) ATR Japanese C language and (b) iTaukei speaker’s cross language 
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Table 1. Efficiency and EER of the ASR system for KLDA, KICA and KLDA respectively 
for ATR Japanese C language 








KLDA KICA KPCA 
Efficiency in% EERin% Efficiency in% EERin% Efficiency in% EER in % 
6 sec 99.7 1.95 99.6 2.31 99.1 3.41 
4 sec 99.5 2.29 99.1 3.20 98.2 4.11 
2 sec 98.8 323 98.3 4.32 97.6 5.3 





Table 2. Efficiency and EER of the ASR system for KLDA, KICA and KLDA respectively 
for10 iTaukei speakers cross language 








KLDA KICA KPCA 
Efficiency in% EERin% Efficiency in% EERin% Efficiency in% EER in % 
6 sec 94.9 2.04 94.6 3.4 94.1 4.1 
4 sec 94.3 2.34 94.1 3.7 93.5 4.8 
2 sec 93.5 3.2.0 93.1 4.1 92.6 5.3 





5. CONCLUSION 

An experimental evaluation of the performance of the ASR system has been done on 6 sec of voice 
data of ATR Japanese C language. For the 10400, voice samples of the ATR Japanese C language speaker 
recognition accuracy 99.7%, 99.6%, and 99.1% and equal error rate (EER) is 1.95%, 2.31%, and 3.41% 
respectively for KLDA, KICA, and KPCA. The EER improvement of the KLDA technique-based ASR system 
compared with KICA and KPCA is 4.25% and 8.51% respectively. We find that non-linear transformations 
usually lead to better classification than non-linear transformations, and are therefore a promising new research 
direction. We also found that the supervised transformations are usually stronger than those not supervised. 
We think it would be worth searching for other supervised approaches which could be built similarly to 
the KLDA or KICA-based ASR application methodology. Such transformations significantly improved 
the phonological knowledge ASR training framework by providing a comprehensive and accurate 
classification of speaking contextual features unique to real-time speakers. 
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